{
  "nbformat": 4,
  "nbformat_minor": 0,
  "metadata": {
    "colab": {
      "name": "Scraping two YouTube accounts using python libraries.ipynb",
      "version": "0.3.2",
      "provenance": [],
      "collapsed_sections": [],
      "include_colab_link": true
    },
    "kernelspec": {
      "name": "python3",
      "display_name": "Python 3"
    }
  },
  "cells": [
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "view-in-github",
        "colab_type": "text"
      },
      "source": [
        "<a href=\"https://colab.research.google.com/github/Tanu-N-Prabhu/Python/blob/master/Data%20Scraping%20from%20the%20Web/Scraping_two_YouTube_accounts_using_python_libraries.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "rE07fxu3MLOe",
        "colab_type": "text"
      },
      "source": [
        "# Scraping two YouTube accounts using python libraries."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "dCkSzp06MPjz",
        "colab_type": "text"
      },
      "source": [
        "## Let us master the art of scraping two YouTube accounts and report their subscriber's difference."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "fSTAtR8JMWJ4",
        "colab_type": "text"
      },
      "source": [
        "![alt text](https://assets.entrepreneur.com/content/3x2/2000/20180117155526-youtube.jpeg?width=700&crop=2:1)\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "djUxccO6Mbjx",
        "colab_type": "text"
      },
      "source": [
        "**In this article, we will use the python requests library and BeautifulSoup to scrape raw data from unrefined HTML source code.**\n",
        "\n",
        "The most useful libraries required for web scraping are:\n",
        "\n",
        "1. [Beautiful Soup.](https://www.crummy.com/software/BeautifulSoup/bs4/doc/?source=post_page---------------------------)\n",
        "\n",
        "2. [Requests](https://2.python-requests.org/en/master/?source=post_page---------------------------).\n",
        "\n",
        "Let's get started, this tutorial is divided into four parts each part is listed down below:"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "skRTv_iWZUVQ"
      },
      "source": [
        "\n",
        "\n",
        "I have chosen to scrape two YouTube accounts, named \"[PewDiePie](https://www.youtube.com/channel/UC-lHJZR3Gqxm24_Vd_AJ5Yw)\" and \"[T-Series](https://www.youtube.com/user/tseries),\" and report the subscriber difference. I have been a diehard fan of PewDiePie for the past 4 years, and he was the number one YouTuber in terms of his subscribers, up until the last six to ten months. At this point, T-series arrived and eventually outperformed and dethroned PewDiePie, setting a record and becoming the most subscribed channel (apart from YouTube’s own music channel). I did not like T-Series taking over, or how the sub-gap between T-Series and PewDiePie continues to increase drastically. In addition, this subscription gap was one of the biggest controversies in YouTube’s history, which is another reason why I chose to scrape this data. I have a strong feeling that eventually PewDiePie will surpass T-Series again, and remain as the number one YouTuber forever. Scraping these accounts and determining the subscriber gap will allow me to state with accuracy how close PewDiePie is to regain the title of the most subscribed YouTuber.\n",
        "\n",
        "---\n",
        "\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "03SLx2EUKm6K",
        "colab_type": "text"
      },
      "source": [
        "To find the HTML tags, you must locate both \"[PewDiePie](https://www.youtube.com/channel/UC-lHJZR3Gqxm24_Vd_AJ5Yw)\" and \"[T-Series](https://www.youtube.com/user/tseries),\" YouTube channels. After locating the channels find for the total numbers of subscribers on the page, which would be displayed right below their respected channel names. When you get this right-click on that subscribers and then click on Inspect option or (Ctrl-Shift-I), a new page along with the selected tag will open to your right next to the page. And there you go you can see the HTML tag and the element. Below I have included a screenshot of PewDiePie's channel along with the tags. The same continues to T-Series's channel as it is a mechanical process, follow the same to do it, so which is why I just included only PewDiePie's channel.\n",
        "\n",
        "<center><img src=\"http://i63.tinypic.com/2nkpet1.png\" width = 900 height = 500></center>\n",
        "\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "3jcNfFy6Qx20",
        "colab_type": "text"
      },
      "source": [
        "In the above screenshot the tags are \"yt-formatted-string\" it's a custom YouTube tag, unfortunately, this doesn't work. I mean that when I scraped this the soup.findAll method returned an empty pair of brackets ([ ]) and I don't know why maybe it's because of the custom tag and not the regular HTML tag. So, solving this, I just had to look at what is actually being scraped meaning I tried to see what BeautifulSoup is actually returning then I got the answer, soup is actually printing a bit different from the raw HTML, then I searched for the keyword i.e. Subscribers (96, 197, 730) inside the soup using Ctrl-F and then I got the actual HTML tags and the elements. Shown below is the final screenshot of the problem. The actual HTML tag is span and the class is \"yt-subscription-button-subscriber-count-branded-horizontal subscribed yt-uix-tooltip.\""
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "Dx1judwfQcqK",
        "colab_type": "text"
      },
      "source": [
        "\n",
        "\n",
        "<center><img src=\"http://i63.tinypic.com/s6850m.png\"  width = 800 height = 500></center>\n",
        "\n",
        "\n",
        "---\n",
        "\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "UsLBhRUWqj8I",
        "colab_type": "text"
      },
      "source": [
        "Let us import few important libraries such as Requests and BeautifulSoup. Let us store the URL of the professor in the variable named \"url\" and \"url2\". The URL of the website can be found here: \"PewDiePie\" and \"T-Series\". Here we use the requests library by passing \"url\" and \"url2\" as a parameter, be careful don't run this multiple times. If you get like Response 200 then its success, if you get something else then there is something wrong with maybe the code or your browser I don't know.\n",
        "\n"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "NAW65Je1bYoq",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "import requests\n",
        "from bs4 import BeautifulSoup"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "rsaM5Djub7bv",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "url = 'https://www.youtube.com/user/PewDiePie'"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "7qJNC1T7qdck",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "url2 = 'https://www.youtube.com/user/tseries'"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "7bxta4rTcFhJ",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "page = requests.get(url)"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "XvihGXapqfhq",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "page2 = requests.get(url2)"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "B5Ad-3jbcMvK",
        "colab_type": "code",
        "outputId": "14221158-48c2-47b9-8f9b-2ffb0f3deda9",
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 35
        }
      },
      "source": [
        "print(page)"
      ],
      "execution_count": 0,
      "outputs": [
        {
          "output_type": "stream",
          "text": [
            "<Response [200]>\n"
          ],
          "name": "stdout"
        }
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "5XFASJ85-C6e",
        "colab_type": "code",
        "outputId": "55746acd-144d-4e2b-e8f4-b4adb2e9e4e4",
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 35
        }
      },
      "source": [
        "print(page2)"
      ],
      "execution_count": 0,
      "outputs": [
        {
          "output_type": "stream",
          "text": [
            "<Response [200]>\n"
          ],
          "name": "stdout"
        }
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "GMpgAq-bamcp",
        "colab_type": "text"
      },
      "source": [
        "Next, we use the BeautifulSoup library by passing the page.text as a parameter and using the HTML parser. You can try to print the soup, but printing the soup doesn't give you the answer, rather it contains huge chunks of HTML data, so I decided not to show it here. You can then copy the HTML tag and class if any, and then place it inside the soup.findAllmethod."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "n6MzwvRkcO5A",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "soup = BeautifulSoup(page.text, \"html.parser\")"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "CSHm0Bv2qnYM",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "soup2 = BeautifulSoup(page2.text, \"html.parser\")"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "tIWAyJkWpNIi",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "pew = soup.findAll(\"span\", {\"class\": \"yt-subscription-button-subscriber-count-branded-horizontal subscribed yt-uix-tooltip\"})"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "DY4RcghdrOIO",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "tseries = soup2.findAll(\"span\", {\"class\": \"yt-subscription-button-subscriber-count-branded-horizontal subscribed yt-uix-tooltip\"})"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "9LDpE0aNpYki",
        "colab_type": "code",
        "outputId": "59f690f0-83f2-43d5-a82b-d6c643116034",
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 55
        }
      },
      "source": [
        "print(pew)"
      ],
      "execution_count": 0,
      "outputs": [
        {
          "output_type": "stream",
          "text": [
            "[<span aria-label=\"96,197,730 subscribers\" class=\"yt-subscription-button-subscriber-count-branded-horizontal subscribed yt-uix-tooltip\" tabindex=\"0\" title=\"96,197,730\">96,197,730</span>]\n"
          ],
          "name": "stdout"
        }
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "UHCU_7F2rXTO",
        "colab_type": "code",
        "outputId": "f90d79fc-f29a-4c83-f457-be95126d1b60",
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 55
        }
      },
      "source": [
        "print(tseries)"
      ],
      "execution_count": 0,
      "outputs": [
        {
          "output_type": "stream",
          "text": [
            "[<span aria-label=\"100,057,202 subscribers\" class=\"yt-subscription-button-subscriber-count-branded-horizontal subscribed yt-uix-tooltip\" tabindex=\"0\" title=\"100,057,202\">100,057,202</span>]\n"
          ],
          "name": "stdout"
        }
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "h3e556wUapLv",
        "colab_type": "text"
      },
      "source": [
        "Here we remove all the HTML tags and convert it to a text format, this can be done with the help of get_text method placed inside a for loop. This converts the HTML into the text format"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "8q_M7xh2pd1K",
        "colab_type": "code",
        "outputId": "c9040f4a-333c-4d77-ffd2-fe65bcbffd8b",
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 35
        }
      },
      "source": [
        "for subs in pew:\n",
        "  print(subs.get_text())"
      ],
      "execution_count": 0,
      "outputs": [
        {
          "output_type": "stream",
          "text": [
            "96,197,730\n"
          ],
          "name": "stdout"
        }
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "POQvvW7ZrcJG",
        "colab_type": "code",
        "outputId": "043b31a8-2af8-4ee6-9933-4193c45cc6bd",
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 35
        }
      },
      "source": [
        "for subs1 in tseries:\n",
        "  print(subs1.get_text())"
      ],
      "execution_count": 0,
      "outputs": [
        {
          "output_type": "stream",
          "text": [
            "100,057,202\n"
          ],
          "name": "stdout"
        }
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "0z0Nt32K9YBP",
        "colab_type": "text"
      },
      "source": [
        "\n",
        "\n",
        "---\n",
        "\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "-vNt72EFiujK",
        "colab_type": "text"
      },
      "source": [
        "Formatting the scraped data was not an easy task because the values that were returned were a series of numbers separated by commas. Puzzled by this formatting issue, I used the website [stack overflow](https://stackoverflow.com/questions/16233593/how-to-strip-comma-in-python-string?source=post_page---------------------------), which helped me through one of their forums. In that forum, they discussed replace (): gets rid of the commas, and the format (): helps in formatting the values. This removal of commas must be done because it was difficult to perform subtraction with commas. Additionally, these values were of type string and they needed to be explicitly converted to an integer by typecasting because it is not possible to subtract two strings.\n",
        "\n"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "0wPFB_hru04Y",
        "colab_type": "code",
        "outputId": "e0aac313-d6ac-484e-b74a-995e4cc7abfa",
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 35
        }
      },
      "source": [
        "pewdiepie = subs.get_text().replace(\",\", \"\")\n",
        "pewdiepie"
      ],
      "execution_count": 0,
      "outputs": [
        {
          "output_type": "execute_result",
          "data": {
            "text/plain": [
              "'96197730'"
            ]
          },
          "metadata": {
            "tags": []
          },
          "execution_count": 36
        }
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "qFFWPirau8dO",
        "colab_type": "code",
        "outputId": "737dd6bd-a18d-41e7-cc85-a1bcdc3cb0d0",
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 35
        }
      },
      "source": [
        "tseries = subs1.get_text().replace(\",\", \"\")\n",
        "tseries"
      ],
      "execution_count": 0,
      "outputs": [
        {
          "output_type": "execute_result",
          "data": {
            "text/plain": [
              "'100057202'"
            ]
          },
          "metadata": {
            "tags": []
          },
          "execution_count": 37
        }
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "cHAaluvAs9Ii",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "difference = int(tseries) - int(pewdiepie)"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "7Fw-3DstvQNs",
        "colab_type": "code",
        "outputId": "0f46c9c4-8750-47be-9afd-24fc194e347f",
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 35
        }
      },
      "source": [
        "difference"
      ],
      "execution_count": 0,
      "outputs": [
        {
          "output_type": "execute_result",
          "data": {
            "text/plain": [
              "3859472"
            ]
          },
          "metadata": {
            "tags": []
          },
          "execution_count": 39
        }
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "GCmc8fUqjbCC",
        "colab_type": "code",
        "outputId": "9c4bf918-fc7b-4bc8-8cac-465585636f59",
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 35
        }
      },
      "source": [
        "print(\"The sub gap between T-series and PewDiePie is  ==> \"'{:,}'.format(difference))"
      ],
      "execution_count": 0,
      "outputs": [
        {
          "output_type": "stream",
          "text": [
            "The sub gap between T-series and PewDiePie is  ==> 3,859,472\n"
          ],
          "name": "stdout"
        }
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "U2ykuTAZa550",
        "colab_type": "text"
      },
      "source": [
        "After typecasting, it was a one-step process of subtracting the two variables which contained integers in them. Hence, this is how I successfully obtained the sub difference by scraping the two YouTube channels."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "0sQwCsvna7jJ",
        "colab_type": "text"
      },
      "source": [
        "Thank you guys for spending your time reading my tutorial, stay tuned for more updates. Let me know what is your opinion about this tutorial in the comment section below. Also if you have any doubts regarding the code, comment section is all yours. Have a nice day."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "WbippKuA9auY",
        "colab_type": "text"
      },
      "source": [
        "\n",
        "\n",
        "---\n",
        "\n"
      ]
    }
  ]
}
